In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))

Chicago data

The data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 (see the module docstring of open_cp.sources.chicago See also https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

In this notebook, we quickly look at the data, check that the data agrees between both sources, and demo some of the library features provided for loading the data.


In [2]:
import open_cp.sources.chicago as chicago
import geopandas as gpd

import sys, os, csv, lzma
filename = os.path.join("..", "..", "open_cp", "sources", "chicago.csv")
filename_all = os.path.join("..", "..", "open_cp", "sources", "chicago_all.csv.xz")
filename_all1 = os.path.join("..", "..", "open_cp", "sources", "chicago_all1.csv.xz")

Let us look at the snapshot of the last year, vs the total dataset. The data appears to be the same, though the exact format changes.


In [3]:
with open(filename, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))


['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']
['HZ560767', '12/22/2016 02:55:00 AM', '010XX N CENTRAL PARK AVE', '4387', 'OTHER OFFENSE', 'VIOLATE ORDER OF PROTECTION', 'APARTMENT', 'N', 'Y', '1112', '27', '26', '1152189', '1906649', '41.899712716', '-87.716454159', '(41.899712716, -87.716454159)']

In [4]:
with lzma.open(filename_all, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))


['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']
['8651563', 'HV322174', '06/05/2012 11:00:00 AM', '022XX N CANNON DR', '0810', 'THEFT', 'OVER $500', 'STREET', 'false', 'false', '1814', '018', '43', '7', '06', '1175057', '1915111', '2012', '02/04/2016 06:33:39 AM', '41.922450893', '-87.632206293', '(41.922450893, -87.632206293)']

As well as loading data directly into a TimedPoints class, we can process a sub-set of the data to GeoJSON, or straight to a geopandas dataframe (if geopandas is installed).


In [5]:
geo_data = chicago.load_to_GeoJSON()
geo_data[0]


Out[5]:
{'geometry': {'coordinates': [-87.716454159, 41.899712716], 'type': 'Point'},
 'properties': {'address': '010XX N CENTRAL PARK AVE',
  'case': 'HZ560767',
  'crime': 'OTHER OFFENSE',
  'location': 'APARTMENT',
  'timestamp': '2016-12-22T02:55:00',
  'type': 'VIOLATE ORDER OF PROTECTION'},
 'type': 'Feature'}

In [6]:
frame = chicago.load_to_geoDataFrame()
frame.head()


Out[6]:
address case crime geometry location timestamp type
0 010XX N CENTRAL PARK AVE HZ560767 OTHER OFFENSE POINT (-87.71645415899999 41.899712716) APARTMENT 2016-12-22T02:55:00 VIOLATE ORDER OF PROTECTION
1 051XX S WASHTENAW AVE HZ561134 BATTERY POINT (-87.691539994 41.800445234) RESIDENTIAL YARD (FRONT/BACK) 2016-12-22T11:17:00 AGGRAVATED: OTHER FIREARM
2 059XX W DIVERSEY AVE HZ565584 DECEPTIVE PRACTICE POINT (-87.774165121 41.931166274) RESIDENCE 2016-12-09T12:00:00 FINANCIAL IDENTITY THEFT $300 AND UNDER
3 001XX N STATE ST HZ561772 THEFT POINT (-87.62787669799999 41.883500187) DEPARTMENT STORE 2016-12-22T18:50:00 RETAIL THEFT
4 008XX N MICHIGAN AVE HZ561969 THEFT POINT (-87.624095634 41.897982937) SMALL RETAIL STORE 2016-12-22T19:20:00 RETAIL THEFT

Explore with QGIS

We can save the dataframe to a shape-file which can be viewed in e.g. QGIS.

To explore the spatial-distribution, I would recommend using an interactive GIS package. Using QGIS (free and open source) you can easily add a basemap using GoogleMaps or OpenStreetMap, etc. See http://maps.cga.harvard.edu/qgis/wkshop/basemap.php

I found this to be slightly buggy. On Windows, QGIS 2.18.7 I found that the following worked:

  • First open the chicago.shp file produced from the line above.
  • Select the Coordinate reference system "WGS 84 / EPSG:4326"
  • Now go to the menu "Web" -> "OpenLayers plugin" -> Whatever
  • The projection should change to EPSG:3857. The basemap will obscure the point map, so in the "Layers Panel" drag the basemap to the bottom.
  • Selecting EPSG:3857 at import time doesn't seem to work (which is different to the instructions..!)

In [7]:
# On my Windows install, if I don't do this, I get a GDAL error in
# the Jupyter console, and the resulting ".prj" file is empty.
# This isn't critical, but it confuses QGIS, and you end up having to
# choose a projection when loading the shape-file.
import os
os.environ["GDAL_DATA"] = "C:\\Users\\Matthew\\Anaconda3\\Library\\share\\gdal\\"

frame.to_file("chicago")

A geoPandas example

Let's use the "generator of GeoJSON" option shown above to pick out only BURGLARY crimes from the 2001-- dataset (which is too large to easily load into a dataframe in one go).


In [8]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "THEFT" ]
    
frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()


Out[8]:
address case crime geometry location timestamp type
0 022XX N CANNON DR HV322174 THEFT POINT (-87.632206293 41.922450893) STREET 2012-06-05T11:00:00 OVER $500
1 077XX S EUCLID AVE HV326989 THEFT POINT (-87.57727330199999 41.754642143) RESIDENCE 2012-05-01T01:00:00 AGG: FINANCIAL ID THEFT
2 068XX S WASHTENAW AVE HV326720 THEFT POINT (-87.690703078 41.769306564) APARTMENT 2012-06-09T02:00:00 FROM BUILDING
3 019XX N CICERO AVE HV326202 THEFT POINT (-87.746056778 41.915729613) DEPARTMENT STORE 2012-06-08T16:30:00 $500 AND UNDER
4 034XX N HALSTED ST HV327230 THEFT POINT (-87.64941621200001 41.944953005) BAR OR TAVERN 2012-06-10T02:30:00 POCKET-PICKING

In [9]:
frame.to_file("chicago_all_theft")

In [10]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "BURGLARY" ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()


Out[10]:
address case crime geometry location timestamp type
0 003XX W 94TH PL HV327134 BURGLARY POINT (-87.632356731 41.723006487) RESIDENCE 2012-06-09T23:30:00 FORCIBLE ENTRY
1 011XX N HERMITAGE AVE HV327217 BURGLARY POINT (-87.671145473 41.902514714) RESIDENCE 2012-06-06T12:00:00 FORCIBLE ENTRY
2 047XX W WEST END AVE HV327110 BURGLARY POINT (-87.743935241 41.883144607) RESIDENCE 2012-06-09T01:30:00 FORCIBLE ENTRY
3 038XX S SACRAMENTO AVE HV327153 BURGLARY POINT (-87.69951882399999 41.823319195) APARTMENT 2012-06-09T12:00:00 UNLAWFUL ENTRY
4 013XX W 91ST ST HV327223 BURGLARY POINT (-87.65724045 41.72858934) RESIDENCE 2012-06-09T02:00:00 FORCIBLE ENTRY

In [11]:
frame.to_file("chicago_all_burglary")

In [12]:
frame["type"].unique()


Out[12]:
array(['FORCIBLE ENTRY', 'UNLAWFUL ENTRY', 'ATTEMPT FORCIBLE ENTRY',
       'HOME INVASION'], dtype=object)

In [13]:
frame["location"].unique()


Out[13]:
array(['RESIDENCE', 'APARTMENT', 'DEPARTMENT STORE', 'RESIDENCE-GARAGE',
       'SCHOOL, PUBLIC, BUILDING', 'SMALL RETAIL STORE',
       'CONSTRUCTION SITE', 'CHA PARKING LOT/GROUNDS',
       'CHURCH/SYNAGOGUE/PLACE OF WORSHIP', 'APPLIANCE STORE',
       'RESTAURANT', 'CHA APARTMENT', 'CLEANING STORE',
       'GROCERY FOOD STORE', 'OTHER', 'DAY CARE CENTER',
       'RESIDENTIAL YARD (FRONT/BACK)', 'COMMERCIAL / BUSINESS OFFICE',
       'ABANDONED BUILDING', 'BARBERSHOP',
       'PARKING LOT/GARAGE(NON.RESID.)', 'WAREHOUSE', 'VACANT LOT/LAND',
       'BAR OR TAVERN', 'PARK PROPERTY', 'CTA GARAGE / OTHER PROPERTY',
       'RESIDENCE PORCH/HALLWAY', 'BANK', 'FEDERAL BUILDING', 'STREET',
       'OTHER RAILROAD PROP / TRAIN DEPOT', 'GAS STATION',
       'MEDICAL/DENTAL OFFICE', 'CAR WASH', 'MOVIE HOUSE/THEATER',
       'CONVENIENCE STORE', 'HOTEL/MOTEL', 'DRIVEWAY - RESIDENTIAL',
       'SCHOOL, PRIVATE, BUILDING', 'ALLEY', 'SCHOOL, PUBLIC, GROUNDS',
       'CURRENCY EXCHANGE', 'HOSPITAL BUILDING/GROUNDS',
       'COLLEGE/UNIVERSITY GROUNDS', 'FACTORY/MANUFACTURING BUILDING',
       'TAVERN/LIQUOR STORE', 'GOVERNMENT BUILDING/PROPERTY',
       'VEHICLE NON-COMMERCIAL', 'LIBRARY', 'VEHICLE-COMMERCIAL',
       'POOL ROOM', 'SCHOOL, PRIVATE, GROUNDS',
       'LAKEFRONT/WATERFRONT/RIVERBANK', 'DRUG STORE', 'CEMETARY',
       'NURSING HOME/RETIREMENT HOME', 'SIDEWALK',
       'COLLEGE/UNIVERSITY RESIDENCE HALL', 'ATHLETIC CLUB',
       'BOAT/WATERCRAFT', 'ANIMAL HOSPITAL', 'COIN OPERATED MACHINE',
       'FIRE STATION', 'CHA HALLWAY/STAIRWELL/ELEVATOR', 'FOREST PRESERVE',
       'OTHER COMMERCIAL TRANSPORTATION', 'PAWN SHOP',
       'SPORTS ARENA/STADIUM', 'BOWLING ALLEY',
       'POLICE FACILITY/VEH PARKING LOT', 'SAVINGS AND LOAN',
       'CTA PLATFORM', 'CREDIT UNION', 'ATM (AUTOMATIC TELLER MACHINE)',
       'NEWSSTAND', 'CTA TRACKS - RIGHT OF WAY', 'CTA STATION', '',
       'HIGHWAY/EXPRESSWAY', 'AIRPORT/AIRCRAFT', 'BRIDGE',
       'DELIVERY TRUCK', 'AIRPORT TERMINAL UPPER LEVEL - SECURE AREA',
       'CTA BUS', 'TAXICAB', 'VEHICLE - OTHER RIDE SERVICE',
       'AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA', 'CTA BUS STOP',
       'CTA TRAIN', 'AIRPORT EXTERIOR - NON-SECURE AREA'], dtype=object)

Upon loading into QGIS to visualise, we find that the 2001 data seems to be geocoded in a different way... The events are not on the road, and the distribution looks less artificial. Let's extract the 2001 burglary data, and then the all the 2001 data, and save.


In [14]:
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["timestamp"].startswith("2001") ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()


Out[14]:
address case crime geometry location timestamp type
0 079XX S CAMPBELL AVE G397434 HOMICIDE POINT (-87.685268108 41.749135591) STREET 2001-08-09T19:30:00 FIRST DEGREE MURDER
1 048XX W KAMERLING AVE G473112 HOMICIDE POINT (-87.746889161 41.905072512) STREET 2001-08-10T01:04:00 FIRST DEGREE MURDER
2 009XX N LAMON AVE G477463 HOMICIDE POINT (-87.74830887500001 41.897236716) STREET 2001-08-11T23:06:00 FIRST DEGREE MURDER
3 036XX W OHIO ST G476774 HOMICIDE POINT (-87.717959287 41.891770726) APARTMENT 2001-08-12T18:17:00 FIRST DEGREE MURDER
4 019XX S RACINE AVE G477822 HOMICIDE POINT (-87.656374568 41.85626764) AUTO 2001-08-12T04:35:00 FIRST DEGREE MURDER

In [15]:
frame.to_file("chicago_2001")

Explore rounding errors

We check the following:

  • The X and Y COORDINATES fields (which we'll see, in a different notebook, at longitude / latitude coordinates projected in EPSG:3435 in feet) are always whole numbers.
  • The longitude and latitude data contains at most 9 decimals places of accuracy.

In the other notebook, we look at map projections. The data is most consistent with the longitude / latitude coordinates being the primary source, and the X/Y projected coordinates being computed and rounded to the nearest integer.


In [16]:
longs, lats = [], []
xcs, ycs = [], []

with open(filename, "rt") as file:
    reader = csv.reader(file)
    header = next(reader)
    print(header)
    for row in reader:
        if len(row[14]) > 0:
            longs.append(row[14])
            lats.append(row[15])
            xcs.append(row[12])
            ycs.append(row[13])


['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']

In [17]:
set(len(x) for x in longs), set(len(x) for x in lats)


Out[17]:
({8, 9, 10, 11, 12}, {8, 9, 10, 11, 12, 13})

In [18]:
any(x.find('.') >= 0 for x in xcs), any(y.find('.') >= 0 for y in ycs)


Out[18]:
(False, False)

Repeated data

Mostly the "case" assignment is unique, but there are a few exceptions to this.


In [ ]:
import collections

with lzma.open(filename_all, "rt") as file:
    c = collections.Counter( event["properties"]["case"] for event in
                            chicago.generate_GeoJSON_Features(file, type="all") )
    
multiples = set( key for key in c if c[key] > 1 )
len(multiples)

In [ ]:
with lzma.open(file_all, "rt") as file:
    data = gpd.GeoDataFrame.from_features(
        event for event in chicago.generate_GeoJSON_Features(file, type="all")
        if event["properties"]["case"] in multiples
        )
    
len(data), len(data.case.uniques())